In this tutorial, we will be analyzing sample N1 from a dataset of tumor and immune cell populations from human early-stage lung adenocarcinomas (He et al., Oncogene 2021). Data has been obtained using 10X Genomics technology. The sample N1 is composed of 14,972 single cells that were sequenced on the Illumina Hiseq X platform. Raw data can be download from here. Prior to the analysis with popsicleR, raw data has been processed with Cell Ranger software to align reads and generate feature-barcode matrices. Cell Ranger data, comprising the gzipped TSV files of the feature-barcode matrix and of feature and barcode sequences, can be found here.
We start assigning a name to the sample (i.e., N1) and defining the path to the folder containing raw input data. popsicleR accepts as input either files from the Cell Ranger pipeline of 10X Genomics or a feature-barcode matrix of raw counts generated from any microfluidic-, microwell plates-, or droplet-based scRNA-seq technology. To organize input, it is convenient to create a directory named as the sample name and containing a sub-folder with the input data (either in the form of Cell Ranger gzipped TSV files or of a tab-delimited matrix). In this example, the Cell Ranger input files are stored in the cellranger_data folder. Some steps of the analysis that involve Seurat functions can be performed using parallelization following the instructions reported here.
library(popsicleR)
## Hello and welcome to popsicleR, an interactive Pipeline Of Preprocessing for SIngle CelL Experiment data.
## In this workflow, messages are colour coded:
## green messages will provide information on the ongoing step
## cyan messages will require the user to provide an input for advancing the script
## red messages will report missing information or wrong inputs
## grey messages report functions and software system outputs.
# set the working directory and sample name
sample.name <- "N1"
main.dir <- file.path("~", "PoPsicleExample", sample.name)
input.data.dir <- file.path(main.dir, "cellranger_data")
setwd(main.dir)
We start the analysis using the PrePlots function. PrePlots first reads the output of the cellranger pipeline from 10X through the Read10X() function of the Seurat package and returns a unique molecular identified (UMI) count matrix. The values in the count matrix represent the number of molecules for each feature (i.e., gene; row) that are detected in each cell (column). During the generation of the count matrix, a soft filter is applied to remove low quality cells. By default, PrePlots removes genes that are expressed in less than 10% of cells (percentage = 0.1) and cells that express less than 200 genes (gene_filter = 200). Additional parameters specify the organism (organism = "human") and if data has been generated using cellranger (cellranger = TRUE). If cellranger is set to FALSE, input_data must be the name of a feature-barcode matrix file. PrePlots returns violin (Figure 1), density (Figure 2), and scatter (Figure 3) plots to easily explore QC metrics and set user-defined filtering thresholds. These plots report commonly used QC metrics as the number of unique genes detected in each cell (nFeature_RNA), the total number of molecules detected within a cell (nCount_RNA), the percentage of reads that map to the mitochondrial genome (percent_mt), and to ribosomal (percent_ribo) and dissociation genes (percent_disso). To help the user avoiding the removal of cells intrinsically characterized by outlying levels of unique genes, total UMIs, and mitochondrial fractions, PrePlots also returns the cell abundance and the distributions of these features within specific cell populations identified by a list of user-defined marker genes (e.g., SFTPC for alveolar cells and CD3D for T cells; Figure 4-5). All plots are saved in the 01.QC_Plots folder within the working directory.
# Define a list of cell-type markers populations.markers <-
# c('EPCAM','VIM','COL1A1','PECAM1','PTPRC','CD3D','CD14','SFTPC') sample.umi
# <- PrePlots(sample = sample.name, input_data = input.data.dir, genelist =
# populations.markers, percentage = 0.1, gene_filter = 200, cellranger = TRUE,
# organism = 'human', out_folder = main.dir) Create a Seurat object from the
# raw data and visualize QC metrics
Figure 1. Violin plots.
Figure 2. Density plots.
Figure 3. Scatter plots.
Figure 4. Density plots for cells expressing CD3D, a marker of T lymphocytes.
Figure 5. Scatter plots for cells expressing CD3D, a marker of T lymphocytes.
Appropriate custom filters to discard low-quality cells can be applied using the FilterPlots function. Cells can be filtered based on the minimum and maximum number of genes detected in each cell (G_RNA_low and G_RNA_hi), the minimum and maximum number of molecules detected within a cell (U_RNA_low and U_RNA_hi), and the maximum percentage of mitochondrial, ribosomal, and dissociation genes (percent_mt_hi, percent_ribo_hi and percent_disso_hi, respectively). Once thresholds are applied, FilterPlots returns summary graphs with data distributions and correlations that highlight with colors cells passing the filtering procedure (Figure 6). Filtering thresholds are displayed as gray lines. All plots are saved in the 01.QC_Plots folder.
# Filter the Seurat object based on QC metrics sample.umi <- FilterPlots(UMI =
# sample.umi, G_RNA_low = 500, G_RNA_hi= Inf, U_RNA_low = -Inf, U_RNA_hi =
# 37000, percent_mt_hi = 10, percent_ribo_hi = 100, percent_disso_hi = 100,
# out_folder = main.dir)
Figure 6. Post filtering scatter plots. Cells passing the filtering procedure are highlighted with colors.
As an additional QC and filtering step, the function CalculateDoublets allows detecting and, eventually, removing cell doublets, i.e., groups of cells captured and associated to the same unique barcode. CalculateDoublets is the R translation of the python module Scrublet (Wolock et al., Cell Syst 2019),implemented based on the code provided by Chengxian Qiu at https://rdrr.io/github/ChengxiangQiu/rscrublet/. The function returns histograms of observed and simulated doublet score distributions, the value of the automatically detected threshold for the simulated doublet score (Figure 7) and the visualization of the cell doublets and of the doublet scores on a UMAP projection (Figure 8). All plots are saved in the 01.QC_Plots folder. It is worth noting that CalculateDoublets can take several minute to complete.
# Doublet detection sample.umi <- CalculateDoublets(UMI = sample.umi, dbs_thr
# ='none', dbs_remove = FALSE, out_folder = main.dir)
Figure 7. Histograms of observed and simulated doublet score distributions. The solid line indicates the automatically detected threshold for the simulated doublet score.
Figure 8. Visualization of the cell doublets and of the doublet scores on a UMAP projection.
The threshold for the simulated doublet score can be manually modified setting the dbs_thr based on visual inspection of the observed and simulated doublet score distributions and of the UMAP projection. Doublets can be filtered setting the dbs_remove argument to TRUE. In this example, we changed the automatically detected threshold setting dbs_thr = 0.22 and remove doublets.
# Doublet removal sample.umi <- CalculateDoublets(UMI = sample.umi, dbs_thr =
# 0.22, dbs_remove = TRUE, out_folder = main.dir)
After removing unwanted cells, data are normalized using the Normalize function. By default, Normalize implements the global-scaling normalization method “LogNormalize” of the Seurat NormalizeData function. LogNormalize normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. The Normalize function also calculates a subset of features that exhibit high cell-to-cell variation in the dataset (i.e., that are highly expressed in some cells, and lowly expressed in others) and that will be used in downstream analysis, like PCA dimensional reduction. This step is based on the FindVariableFeatures function of the Seurat package (with default of 2,000 features, i.e., variable_genes = 2,000). Signal distributions before and after normalization and plots highlighting the highly variable genes are saved in the 02.PreProcessing folder.
# Normalization sample.umi <- Normalize(UMI = sample.umi, variable_genes =
# 2000, out_folder = main.dir)
Before dimensionality reduction, scaling and, eventually, regression must be performed. Scaling is required for PCA and makes the data to have zero-mean and unit standard deviation (standardization). Regression is used to mitigate the effect of unwanted sources of variation, as experimental and biological biases. One source of biological variation is represented by cell cycle that is quantified before scaling or regression. This quantification exploits the method implemented in the Seurat CellCycleScoring function to assign each cell with cell cycle scores and a predicted cell cycle phase, based on the expression of G2/M and S phase markers. ApplyRegression simply scales the data, if no variable to regress is indicated (default, variables = "none"), or performs regression on several variables specified by the user. The variables argument can be a combination of, e.g., the number of genes (nFeature_RNA), UMIs (nCount_RNA), percentage of mithocondrial genes (percent_mt), and the cell cycle scores (S.Score and G2M.Score). The function uses the functionality of Seurat ScaleData. ApplyRegression returns projections on reduced spaces (PCA, t-SNE, and UMAP using 20 principal components) to inspect the effect of variable regression (Figure 9-10). If the explore_PC is set to TRUE, once completed scaling and, eventually, regression, ApplyRegression performs PCA on the scaled (or regressed) data and returns heatmap (Seurat DimHeatmap), jackstraw (Seurat JackStrawPlot), and elbow (Seurat ElbowPlot) plots to determine the optimal number of principal components to include in the downstream analyses. All plots are saved in the 02.PreProcessing folder and in sub-folders named after the regressed variables. In this example, we first simply scale the data (i.e., variables = "none") and then perform regression on cell cycle scores (i.e., variables = c("S.Score", "G2M.Score")).
# Regression sample.umi <- ApplyRegression(UMI = sample.umi, organism =
# 'human', variables = 'none', explore_PC = FALSE, out_folder = main.dir)
Figure 9. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after scaling.
# Regression sample.umi <- ApplyRegression(UMI = sample.umi, organism =
# 'human', variables = c('S.Score', 'G2M.Score'), explore_PC = FALSE,
# out_folder = main.dir)
Figure 10. Visualization of the number of genes, UMIs, percentage of mithocondrial genes, and the cell cycle scores on the PCA embedding after regression on the cell cycle scores.
Once inspected the post-regression PCA, t-SNE, and UMAP plots, we decide to continue with scaling only and to generate the plot to determine the optimal number of principal components (‘explore_PC = TRUE’).
# Regression sample.umi <- ApplyRegression(UMI = sample.umi, organism =
# 'human', variables = 'none', explore_PC = TRUE, out_folder = main.dir)
The CalculateCluster function exploits the graph-based clustering approach implemented through the FindNeighbors and FindClusters functions of Seurat using as input the previously identified number of principal components (dim_pca) and the resolution parameter (cluster_res) that sets the granularity of the clustering, with increasing values leading to a greater number of clusters. The resolution parameter can be either a single value or a set of values to explore cell grouping at different resolutions. Cell clusters are visualized in dimensionality reduced planes as UMAP or t-SNE embeddings (Figure 11), while a phylogenetic tree is used to show correlation between the identified clusters (Figure 12). Beside cluster membership, in UMAP and t-SNE embeddings, cells are colored according to several features that might aid in the assessment of cluster quality or in the identification of clusters of problematic cells that escaped the previous filtering steps (Figure 13). If multiple resolutions have been selected, the clustree package is used to produce a clustering tree that shows how cells are assigned to the clusters at the various clustering resolutions (Figure 14). Finally, CalculateCluster returns a dot plot for a set of user-defined genes (marker.list argument; default is a set of genes marking immune cells) in the clusters identified at the given resolutions (Figure 15). All plots are saved in the 03.Clustering folder. In this example we select dim_pca = 12 principal components, a cluster resolution varying from 0.4 to 0.8 (i.e.,cluster_res =c(0.4,0.6,0.8)) and default marker genes (marker.list = "none").
# Clustering at multiple resolutions sample.umi <- CalculateCluster(UMI =
# sample.umi, dim_pca = 12, organism = 'human', marker.list = 'none', PCA =
# TRUE, cluster_res = c(0.4,0.6,0.8), out_folder=main.dir)
Figure 11. UMAP visualization of cells colored by cluster membership at resolution 0.8. Numbers indicate the cluster identity.
Figure 12. Phylogenetic tree showing correlation between the identified clusters at resulution 0.8.
Figure 13. UMAP visualization of cells colored by percent of ribosomal and dissociation genes and MALAT1 and GAPDH expression levels.
Figure 14. Clustering tree displaying how cells move among clusters as resolution increases from 0.4 to 0.8. Cluster identity is displayed inside the circle; circle color indicates the clustering resolution; circle size is proportional to the number of cells in the cluster. Edge color is related to the number of cells whereas its transparency shows how many cells of the incoming node are assigned to the destination node.
Figure 15. Dot plot reporting, in the clusters identified at resolution 0.8, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.
In case a single resolution is indicated, CalculateCluster uses the FindAllMarkers function of Seurat to determine the differentially expressed genes in the comparison between cells in each cluster and all remaining cells (cluster markers). FindAllMarkers is applied with the minimum fraction of cells min.pct in either of the two populations set to 0.25 and limiting the test to genes showing, on average, at least logfc.threshold = 0.25 fold difference between the two groups of cells. Expression values of the top 10 markers of each cluster are used to generate a heatmap and the expression of the top 2 markers is visualized through t-SNE, UMAP, and violin plots (Figure 16). In this example we select a cluster resolution equal to 0.8 (i.e.,cluster_res =0.8) and a list of generic markers for immune cells (markers = "Immune_cells_human.txt") saved in the current directory as a tab delimited file (available here).
# Clustering at a single resolution sample.umi <- CalculateCluster(UMI =
# sample.umi, dim_pca = 12, organism = 'human', marker.list =
# 'Immune_cells_human.txt', PCA = TRUE, cluster_res = 0.8, out_folder=main.dir)
Figure 16. Violin plot of the top 2 marker genes for cluster 7 at resolution 0.8.
The assignment of cell type identity to clusters is achieved using the MakeAnnotation function. MakeAnnotation annotates cells using SingleR and scMCA built-in references. Specifically, SingleR is used to annotate data from human and mouse samples with the references of the celldex package. For human samples, celldex includes the Human Primary Cell Atlas (HPCA) and the Blueprint Encode (BpEn) references generated from microarrays of human primary cells and RNA-seq profiles of human stroma and immune cells, respectively. For mouse samples, references are derived from a collection of mouse bulk RNA-seq datasets (Mouse RNA-seq) and from the microarray profiles of pure mouse immune cells provided by the Immunological Genome Project (ImmGen). scMCA is used to annotate data from mouse samples with a reference constructed from the single-cell Mouse Cell Atlas (scMCA) and covering all mouse cell types. Single cell annotations for the different references are displayed color-coding cells in t-SNE and UMAP embeddings. Since annotation can be performed either at the cluster or at the single cell level, cells can be colored based on clusters (at single or multiple resolution setting the cluster_res parameter; Figure 17) or on cell type identity (Figure 18). In the annotation at single cell level, it can be difficult to discern the different populations in the low-dimensional embeddings, as cells are overlapping, and the color scales might be inefficient in separating the various cell types. To overcome this limitation, MakeAnnotation produces a plot for each identified population with only cells of the given type highlighted in color on top of the other cells colored in gray (Figure 19). Finally, MakeAnnotation returns dot plots for a set of user-defined genes in the predicted cell populations (marker.list argument as a tab-delimited file with headers saved in the working directory; default is a set of genes marking immune cells; Figure 20) and a pie-chart plot showing the clusters composition according to the various annotations (Figure 21). All plots are saved in the 04.Annotations folder.
# Cell type annotation sample.umi <- MakeAnnotation(UMI = sample.umi, organism
# = 'human', marker.list = 'none', cluster_res= 0.8, out_folder = main.dir)
Figure 17. UMAP visualization of cell clusters colored by population label assigned using the Blueprint ENCODE reference atlas.
Figure 18. UMAP visualization of single cells colored by population label assigned using the Blueprint ENCODE reference atlas.
Figure 19. UMAP visualization of single cells (annotated using the Blueprint ENCODE reference atlas) with CD8+ T cells highlighted in red on top of the other cells colored in gray.
Figure 20. Dot plot reporting, in the cell types assigned using the Blueprint ENCODE reference atlas, the expression of genes commonly associated to general immune populations. Dot size is proportional to the fraction of cells with non-zero expression of the marker gene. Dots are color-coded based on the average scaled normalized expression.
Figure 21. Pie-chart plot displaying, for each cluster at resolution 0.8, the proportion of cells annotated with a specific label using the Blueprint ENCODE reference atlas.
Finally, the Seurat object that includes all the calculated cell features in the metadata slot can be saved in a dedicated output directory for downstream analyses.
output.data.dir <- file.path(main.dir, "RDS objects")
if (!file.exists(output.data.dir)) {
dir.create(output.data.dir)
}
# saveRDS(sample.umi,file =
# file.path(output.data.dir,paste0(sample.name,'.Rds')))
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] popsicleR_0.1.6 scMCA_0.2.0
## [3] celldex_1.0.0 umap_0.2.7.0
## [5] neldermead_1.0-11 optimsimplex_1.0-7
## [7] optimbase_1.0-9 Matrix_1.3-4
## [9] RANN_2.6.1 ggplotify_0.1.0
## [11] RColorBrewer_1.1-2 ggExtra_0.9
## [13] crayon_1.4.2 patchwork_1.1.1
## [15] limma_3.46.0 magrittr_2.0.1
## [17] gridExtra_2.3 session_1.0.3
## [19] future_1.23.0 SingleR_1.4.1
## [21] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [23] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
## [25] IRanges_2.24.1 S4Vectors_0.28.1
## [27] BiocGenerics_0.36.1 MatrixGenerics_1.2.1
## [29] matrixStats_0.61.0 gtools_3.9.2
## [31] ape_5.5 clustree_0.4.4
## [33] ggraph_2.0.5 corrplot_0.92
## [35] ggplot2_3.3.5 dplyr_1.0.7
## [37] R.utils_2.11.0 R.oo_1.24.0
## [39] R.methodsS3_1.8.1 reticulate_1.22
## [41] SeuratObject_4.0.4 Seurat_4.0.5
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 tidyselect_1.1.1
## [3] RSQLite_2.2.8 AnnotationDbi_1.52.0
## [5] htmlwidgets_1.5.4 BiocParallel_1.24.1
## [7] Rtsne_0.15 munsell_0.5.0
## [9] codetools_0.2-18 ica_1.0-2
## [11] DT_0.20 miniUI_0.1.1.1
## [13] withr_2.4.3 colorspace_2.0-2
## [15] highr_0.9 knitr_1.36
## [17] ROCR_1.0-11 tensor_1.5
## [19] listenv_0.8.0 GenomeInfoDbData_1.2.4
## [21] polyclip_1.10-0 pheatmap_1.0.12
## [23] bit64_4.0.5 farver_2.1.0
## [25] parallelly_1.29.0 vctrs_0.3.8
## [27] generics_0.1.1 xfun_0.28
## [29] BiocFileCache_1.14.0 R6_2.5.1
## [31] graphlayouts_0.7.2 rsvd_1.0.5
## [33] cachem_1.0.6 bitops_1.0-7
## [35] spatstat.utils_2.2-0 gridGraphics_0.5-1
## [37] DelayedArray_0.16.3 assertthat_0.2.1
## [39] promises_1.2.0.1 scales_1.1.1
## [41] gtable_0.3.0 beachmat_2.6.4
## [43] globals_0.14.0 goftest_1.2-3
## [45] tidygraph_1.2.0 rlang_0.4.12
## [47] splines_4.0.4 lazyeval_0.2.2
## [49] spatstat.geom_2.3-0 BiocManager_1.30.16
## [51] yaml_2.2.1 reshape2_1.4.4
## [53] abind_1.4-5 httpuv_1.6.3
## [55] tools_4.0.4 ellipsis_0.3.2
## [57] spatstat.core_2.3-2 jquerylib_0.1.4
## [59] ggridges_0.5.3 Rcpp_1.0.7
## [61] plyr_1.8.6 sparseMatrixStats_1.2.1
## [63] zlibbioc_1.36.0 purrr_0.3.4
## [65] RCurl_1.98-1.5 rpart_4.1-15
## [67] openssl_1.4.5 deldir_1.0-6
## [69] pbapply_1.5-0 viridis_0.6.2
## [71] cowplot_1.1.1 zoo_1.8-9
## [73] ggrepel_0.9.1 cluster_2.1.2
## [75] data.table_1.14.2 RSpectra_0.16-0
## [77] scattermore_0.7 lmtest_0.9-39
## [79] fitdistrplus_1.1-6 mime_0.12
## [81] evaluate_0.14 xtable_1.8-4
## [83] compiler_4.0.4 tibble_3.1.6
## [85] KernSmooth_2.23-20 htmltools_0.5.2
## [87] mgcv_1.8-38 later_1.3.0
## [89] tidyr_1.1.4 DBI_1.1.1
## [91] ExperimentHub_1.16.1 tweenr_1.0.2
## [93] formatR_1.11 dbplyr_2.1.1
## [95] rappdirs_0.3.3 MASS_7.3-54
## [97] igraph_1.2.9 pkgconfig_2.0.3
## [99] plotly_4.10.0 spatstat.sparse_2.0-0
## [101] bslib_0.3.1 XVector_0.30.0
## [103] yulab.utils_0.0.4 stringr_1.4.0
## [105] digest_0.6.28 sctransform_0.3.2
## [107] RcppAnnoy_0.0.19 spatstat.data_2.1-0
## [109] rmarkdown_2.11 leiden_0.3.9
## [111] uwot_0.1.10 DelayedMatrixStats_1.12.3
## [113] curl_4.3.2 shiny_1.7.1
## [115] lifecycle_1.0.1 nlme_3.1-153
## [117] jsonlite_1.7.2 BiocNeighbors_1.8.2
## [119] viridisLite_0.4.0 askpass_1.1
## [121] fansi_0.5.0 pillar_1.6.4
## [123] lattice_0.20-45 fastmap_1.1.0
## [125] httr_1.4.2 survival_3.2-13
## [127] interactiveDisplayBase_1.28.0 glue_1.5.1
## [129] shinythemes_1.2.0 png_0.1-7
## [131] BiocVersion_3.12.0 bit_4.0.4
## [133] ggforce_0.3.3 stringi_1.7.6
## [135] sass_0.4.0 blob_1.2.2
## [137] AnnotationHub_2.22.1 BiocSingular_1.6.0
## [139] memoise_2.0.0 irlba_2.3.3
## [141] future.apply_1.8.1